ftp.cs.arizona.edu

home *** CD-ROM | disk | FTP | other *** search

/ ftp.cs.arizona.edu / ftp.cs.arizona.edu.tar / ftp.cs.arizona.edu / icon / newsgrp / group99a.txt / 000123_icon-group-sender _Fri Jun 4 08:03:30 1999.msg < prev next >

Wrap

Internet Message Format | 2000-09-20 | 3KB

Return-Path: <icon-group-sender> Received: (from root@localhost) by baskerville.CS.Arizona.EDU (8.9.1a/8.9.1) id IAA19581 for icon-group-addresses; Fri, 4 Jun 1999 08:03:22 -0700 (MST) Message-Id: <199906041503.IAA19581@baskerville.CS.Arizona.EDU> Delivered-To: icon-group@silliac.cs.arizona.edu Date: Fri, 4 Jun 99 00:51:50 -0500 (CDT) From: gep2@terabites.com Subject: Find the longest matching substrings To: icon-group@optima.CS.Arizona.EDU Errors-To: icon-group-errors@optima.CS.Arizona.EDU Status: RO > I am working on a project that will mark the inserted or deleted phases between two strings. I am wondering whether you know any library procedure that will anchor ( or report ) the longest phases that exists in both strings. I'm not aware of any such ready-made routine, although certainly "longest common substring" logic is frequently (and heavily!) used in file compression software (PKZIP for example). > For example, assume that we have two strings: old:="other strings ab cd e and more substrings" new:="some more substrings ab cd ab cd e" > This function should return phases and position in both strings, i.e. [15, 28, "ab cd e"] for "ab cd e" is the longest phase that matched ( it has three continuous matching elements ). Intriguing that you consider the "three matching elements" phrase (only seven characters) 'longer' than the sixteen-character common phrase " more substrings". I guess the criteria here could be most anything you want to make it, but still... :-) Generally the logic to do stuff like this (in good programs, anyhow) is coded at very low levels using very heavily optimized code... since it tends to be very, very compute-intensive (this is especially true as the two strings get longer). Of course, the problem's solution might be VERY different depending on whether you're looking for something that's "elegant" in terms of the source code, versus something that's FAST, versus something that is storage-efficient, or something else. (The rule about what substring is 'longer' also is going to have an effect too!). If I were going to try to make something like this FAST (and especially if one of the strings were relatively constant) I'd probably want to use a hash (or even direct index?) table pre-computed with the offsets of character substrings (I'd probably use two-character adjacent pairs) in one of those strings; this would enable stepping rapidly through the second string and quickly finding at least (only just the) plausible candidates for launching more detailed substring compare (and length computation) operations. This kind of thing would likely be storage-hungry, but that's less of a concern today than it once was. Gordon Peterson http://web2.airmail.net/gep2/ Support the Anti-SPAM Amendment! Join at http://www.cauce.org/ 12/19/98: the day the Conservatives demonstrated their scorn for their fraudulent sham of representative government. Voters, remember it!